Language-specific encoding in multilingual corpora: Requirements and solutions
نویسندگان
چکیده
Dies ist eine Internet-Sonderausgabe des Aufsatzes „Language-specific encoding in multilingual corpora: Requirements and solutions“ von Jost Gippert (1999). Sie sollte nicht zitiert werden. Zitate sind der Originalausgabe in Multilinguale Corpora: Codierung, Strukturierung, Analyse. 11. Jahrestagung der Gesellschaft für Linguistische Datenverarbeitung (ed. J. Gippert / P. Olivier), Praha 1999, 371-384 zu entnehmen.
منابع مشابه
Chapter 4 Character encoding in corpus construction
Corpus linguistics has developed, over the past three decades, into a rich paradigm that addresses a great variety of linguistic issues ranging from monolingual research of one language to contrastive and translation studies involving many different languages. Today, while the construction and exploitation of English language corpora still dominate the field of corpus linguistics, corpora of ot...
متن کاملStandards & best practice for multilingual computational lexicons: ISLE MILE and more
ISLE (International Standards for Language Engineering) is a transatlantic standards oriented initiative under the Human Language Technology (HLT) programme within the EU-US International Research Co-operation. It is a continuation of the European EAGLES (Expert Advisory Group for Language Engineering Standards) initiative, carried out through a number of subsequent projects funded by the Europ...
متن کاملAgainst multilinguality
1. Introduction An obvious assumption of the present workshop is that multilingual corpora are useful, and should be built and investigated. In the present paper, I would like to point out that this is far from straightforward and actually remains to be proved. In addition, and in a more constructive vein, I want to present some examples that show that the right encoding depends crucially on wh...
متن کاملProcessing Annotated TMX Parallel Corpora
In the later years the amount of freely available multilingual corpora has grown in an exponential way. Unfortunately the way these corpora are made available is very diverse, ranging from simple text files or specific XML schemas to supposedly standard formats like the XML Corpus Encoding Initiative, the Text Encoding Initiative, or even the Translation Memory Exchange formats. In this documen...
متن کاملSemantic-Based Multilingual Document Clustering via Tensor Modeling
A major challenge in document clustering research arises from the growing amount of text data written in different languages. Previous approaches depend on language-specific solutions (e.g., bilingual dictionaries, sequential machine translation) to evaluate document similarities, and the required transformations may alter the original document semantics. To cope with this issue we propose a ne...
متن کامل